Back

Genome Medicine

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match Genome Medicine's content profile, based on 154 papers previously published here. The average preprint has a 0.32% match score for this journal, so anything above that is already an above-average fit.

1
Clinical evidence yield as a framework for evaluating computational predictors and multiplexed assays of variant effect

Shang, Y.; Badonyi, M.; Marsh, J. A.

2026-03-30 bioinformatics 10.64898/2026.03.27.714777 medRxiv
Top 0.1%
37.1%
Show abstract

Interpreting the clinical significance of missense variants of uncertain significance (VUS) remains a major challenge in clinical genetics. Although computational variant effect predictors (VEPs) and multiplexed assays of variant effect (MAVEs) can generate large-scale functional scores, their value is typically assessed using discrimination metrics such as AUROC rather than by the strength of evidence they provide under ACMG/AMP guidelines. Here, we introduce mean evidence strength (MES), a quantitative metric that summarises the pathogenic and benign evidence assigned across missense variants following gene-level Bayesian calibration. Using the acmgscaler framework, we calibrated 12 population-free VEPs across 367 disease genes and analysed 15 MAVE datasets with sufficient clinical data. MES revealed important discrepancies with AUROC, including cases where methods with similar discrimination differed substantially in evidence yield. MAVEs achieved high average MES despite lower AUROC, while several VEPs showed strong discrimination but more limited calibrated evidence. Among predictors, CPT-1 achieved the highest MES and provided moderate or stronger evidence for the largest fraction of ClinVar VUS. MES therefore provides a practical framework for evaluating computational and experimental variant effect datasets in terms of calibrated clinical evidence yield.

2
Methylation profiling in the Million Veteran Program: design, quality control, and smoking-associated epigenetic signatures

Schreiner, P. A.; Markianos, K.; Francis, M.; Despard, B.; Gorman, B. R.; Said, I.; Dong, F.; Gautam, S.; Dochtermann, D.; Shi, Y.; Devineni, P.; Kirkpatrick, C.; Khazanov, N.; Moser, J.; Million Veteran Program, ; Huang, G. D.; Muralidhar, S.; Tsao, P. S.; Pyarajan, S.

2026-04-23 genetic and genomic medicine 10.64898/2026.04.22.26351491 medRxiv
Top 0.1%
34.5%
Show abstract

The Million Veteran Program (MVP) represents the largest and one of the most diverse single cohorts associated with longitudinal Electronic Health Record data (EHR) data. We profiled a subset of samples from MVP using the Illumina Infinium MethylationEPIC Beadchip (EPIC array) to generate one of the largest single cohort methylation dataset to-date. Methylation profiles were analyzed for 45,460 total individuals, with the most populous ancestries composed of 27,455 Europeans, 11,798 African Americans, and 4,859 Admixed Americans. We detail the strict quality control standards implemented to ensure the most robust method of methylation profiling of the MVP cohort. This dataset was then applied to evaluate the effects of smoking exposure on DNA methylation in MVP participants. Ancestry-stratified epigenome-wide association studies (EWAS) of smoking status (ever/never) were performed using over 750,000 probes with certifiable signal. Our multi-ancestry meta-analysis demonstrates replicability with existing EWAS and identifies 3,207 novel probe-smoking associations unlocked via the depth and breadth of data in this cohort.

3
HLA-Resolve: High-Resolution HLA Haplotyping Using Long-Read Hybrid Capture

Glasenapp, M. R.; Yee, M.-C.; Symons, A. E.; Cornejo, O. E.; Garcia, O. A.

2026-03-30 genetic and genomic medicine 10.64898/2026.03.27.26349549 medRxiv
Top 0.1%
33.8%
Show abstract

Accurate HLA typing is critical for transplantation, pharmacogenomics, and disease risk prediction, yet short-read approaches cannot resolve the HLA region's extreme polymorphism. Long-read sequencing improves resolution, but its adoption has been limited by higher cost, reduced base accuracy, limited throughput, and reliance on long-range PCR. To overcome these limitations, we present a multiplexed long-read hybrid capture workflow for PacBio and Oxford Nanopore sequencing that enriches all classical HLA loci and the complete HLA Class III region. A single-step enzymatic fragmentation and barcoding strategy enables automated library prep. We also introduce HLA-Resolve, an HLA typing program optimized for HiFi reads, and validate workflow performance against the Genome in a Bottle, Human Pangenome Reference Consortium, and International Histocompatibility Working Group benchmarks using 32 geographically diverse samples. These advances offer a cost-effective approach for high-resolution HLA typing with clinical applicability and enable investigation of the role of HLA Class III variation in disease.

4
Calibration of in-frame indel variant effect predictors for clinical variant classification

Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,

2026-04-18 bioinformatics 10.64898/2026.04.15.718599 medRxiv
Top 0.1%
33.4%
Show abstract

Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.

5
Benchmarking scRNA-seq Copy Number Inference: A Comprehensive Evaluation and Practitioner Guide

Chang, H.-C.; Shi, Y.; Cheng, H.; Zou, J.; Chang, A. C.-C.; Schlegel, B. T.; Wang, W.; Brown, D. D.; Chen, F.; Wang, S.; Li, D.; Sai, R.; Michel, N.; Oesterreich, S.; Lee, A. V.; Tseng, G. C.

2026-04-15 cancer biology 10.64898/2026.04.12.718050 medRxiv
Top 0.1%
23.4%
Show abstract

Accurately inferring copy number variation (CNV) from scRNA-seq data is critical for identifying malignant cells, reconstructing tumor subclonal architecture, and uncovering the genomic drivers that dictate cancer cell biology. However, the performance of existing tools varies significantly, and current benchmarks lack the breadth of datasets and methods necessary to provide definitive guidance. We present a comprehensive benchmark of 12 CNV inference methods across 28 real datasets (>100,000 cells) and diverse synthetic datasets. By evaluating methods based on malignant cell classification accuracy, CNV inference accuracy, scalability, and robustness, we establish a definitive practitioners guideline: allele-aware methods like Numbat excel when high-quality allelic inference can be achieved, whereas expression-centric tools such as Clonalscope, CopyKAT, inferCNV, and SCEVAN remain reliable when raw sequencing data are unavailable. Our study provides both a practical decision-making framework for researchers and a public repository of standardized CNV profiles to catalyze further methodological innovation.

6
Berrylyzer-an Efficient, Traceable, and Lightweight Intelligent Agentic System for Prenatal Genetic Diagnosis

Meng, M.; Liu, L.; Du, Q.; Zhou, X.; Tian, Y.; Sun, K.; Li, N.; Zhang, P.; Lian, X.; Fan, N.; Zhu, N.; Li, S.; Mao, A.; Li, Y.; Zou, G.

2026-04-04 genetic and genomic medicine 10.64898/2026.04.02.26349929 medRxiv
Top 0.1%
23.3%
Show abstract

Background: Artificial intelligence (AI)-driven variant prioritization has demonstrated substantial utility in expediting genetic diagnosis by ranking the most likely causative variants. While a variety of tools have been developed, few address the unique clinical and technical constraints in prenatal genetic diagnosis. Methods: We introduce Berrylyzer, a novel, end-to-end variant prioritization system applied to prenatal diagnosis.Inspired by clinician's reasoning process during variant interpretation, Berrylyzer applies a modular, stepwise scoring architecture that jointly integrates phenotypic and genomic evidence and delivers a ranked list of candidate variants, achieving high computational efficiency without compromising analytical rigor. Moreover, Berrylyzer natively supports both structured ontologies and free-text clinical narratives, enabling flexible integration into diverse clinical environments. Its performance was rigorously evaluated across two independent, real-world prenatal cohorts and benchmarked against three state-of-the-art methods: Xrare, Exomiser, and PhenIX. Results: Across the two datasets, Berrylyzer ranked 56.41% and 58.12% of diagnostic variants first, and achieved recall rates of 94.02% and 97.42% within top 20, respectively. Berrylyzer outperformed Xrare (85.19% and 87.08%), Exomiser (84.90% and 85.98%), and PhenIX (82.05% and 88.93%). Stratified analysis consistently demonstrated superior performance across diverse disease categories, inheritance patterns, and analytical strategies. Notably, Berrylyzer exhibited robustness regardless of phenotype forms, yielding comparable top 20 recall rates for free-text descriptions and standardized terminologies. Conclusion: Berrylyzer represents an accurate, interpretable, and computationally lightweight variant prioritization system for prenatal genetic diagnosis. The superior performance across heterogeneous diagnostic contexts enables it as a practical solution for seamless integration into clinical pipelines, thereby advancing precision medicine in prenatal settings.

7
PAVS: A Standardized Database of Phenotype-Associated Variants from Saudi Arabian Rare Disease Patients

Abdelhakim, M.; Althagafi, A.; SCHOFIELD, P.; Hoehndorf, R.

2026-04-06 genetic and genomic medicine 10.64898/2026.04.05.26350189 medRxiv
Top 0.1%
23.1%
Show abstract

Genotype-phenotype databases are essential for variant interpretation and disease gene discovery. Genetic variation differs among human populations, mainly in allele frequencies and haplotype patterns shaped by ancestry and demographic history. Population-specific genotypes can influence traits and disease risk; this makes population specific characterization important. Most existing resources focus on the characterization of a population's genetic background, but do not represent the resulting phenotypes. We have developed PAVS (Phenotype-Associated Variants in Saudi Arabia), a curated, publicly accessible database that integrates 5,132 Saudi clinical cases from four Saudi cohorts and 522 cases from analysis of a mixed-population cohort, together with 1,856 cases from the Deciphering Developmental Disorders study (DDD) and 9,588 literature phenopackets. Each case record describes patient-level phenotypes, encoded with the Human Phenotype Ontology (HPO), and links them to genomic variants, gene identifiers, zygosity, pathogenicity classifications, and disease diagnoses mapped to standardized disease terminologies. The data is represented in Phenopackets format and as a knowledge graph in RDF. Additionally, a web interface provides phenotype-based similarity search, gene and variant browsers, and an HPO hierarchy explorer. We evaluate the utility of the phenotype annotations for gene prioritization using semantic similarity. While there are clear differences to global literature-curated databases, phenotypes in PAVS can successfully rank the correct gene at high rank (ROCAUC: 0.89). PAVS addresses a gap in population-specific genotype-phenotype resources and provides a benchmark for phenotype-driven variant prioritization in under-represented populations.

8
Paired wastewater and clinical genomics across metropolitan and hospital catchments reveals SARS-CoV-2 relevant mutations

Ruiz-Rodriguez, P.; Sanz-Carbonell, A.; Perez-Cataluna, A.; Cano-Jimenez, P.; Ruiz-Roldan, L.; Alandes, R.; Valiente-Mullor, C.; Gimeno, C.; Comas, I.; Sanchez, G.; Gonzalez-Candelas, F.; Coscolla, M.

2026-04-06 epidemiology 10.64898/2026.03.31.26346553 medRxiv
Top 0.1%
22.9%
Show abstract

Wastewater (WW) genomics can track SARS-CoV-2 circulation beyond clinical testing, but its ability to reflect clinical diversity and capture severity-linked mutations remains unclear. Here, we integrated 845 clinical genomes and 22 wastewater genomes from Valencia, Spain, across matched metropolitan and hospital catchments. We compared matched WW and clinical sequencing for lineage and mutation surveillance at two levels: metropolitan and hospital. Then, we tested WW sensitivity to detect mutations statistically associated with hospitalization status in regional (n = 4,843), national (n = 10,052) and supranational (n = 39,099) clinical datasets. WW surveillance captured the dominant Omicron background when collapsing lineages into parental lineages constellation but had limited sensitivity for fine-scale sublineage diversity. Performance was strongly catchment dependent: metropolitan wastewater best represented broader community circulation, whereas hospital wastewater was noisier but detected KP.3 months before its appearance in routine metropolitan clinical surveillance. Across clinical datasets, hospitalisation-associated substitutions showed limited reproducibility, although the national and supranational analyses converged on receptor-binding-domain substitutions D405N, K417N and R408S. Networks showed coupling between G252V in NTD with those RBD substitutions involved in immune escape and receptor engagement. Finally, integrating regional to supranational GWAS with interaction networks and wastewater detection prioritised mutations supported by at least two independent association layers, that includes mutations in the Spike, especially in RBD, and the wastewater-exclusive candidate S:V445P, which was missed by contemporaneous clinical sequencing. Overall, WW genomics preferentially recovers the common mutational backbone of SARS-CoV-2 circulation and can highlight important changes missed by clinical sampling, making it a complementary tool for real-time prioritisation of viral evolutionary change.We found partial overlap in lineage composition between WW and clinical samples, with higher overlap at the metropolitan (50%), than at the hospital level (30%). Conversely, we found a slightly higher overlap of individual mutations between WW and clinical samples at the hospital level (20%) than at the metropolitan area (16%), but shared mutations in both datasets were enriched in the Spike gene. Clade composition did not differ between 216 hospitalised and 528 non-hospitalised cases at regional level. Using GWAS and Hierarchical Lasso analysis, we detected mutations associated with hospitalization status in three different datasets: regional, national and worldwide, with little overlap between them. Although few variants replicated across cohorts, the overlap between the Spain and global analyses was statistically enriched and centred on RBD substitutions (D405N, K417N, R408S). Multiple integration of genomic association results prioritised 34/191 wastewater mutations (16 in Spike), including one mutation only detected in wastewater missed by routine clinical surveillance. Wastewater sequencing tracked dominant Omicron waves but performance varied by catchment; integrating clinical association results with interaction network modelling helped prioritise and interpret wastewater-detected mutations.

9
Flex-It: A global standardised genotyping framework for Shigella flexneri

Hawkey, J.; Nodari, C. S.; Iqbal, Z.; Hunt, M.; Wick, R. R.; Chong, C. E.; Jenkins, C.; Howden, B. P.; Holt, K.; Weill, F.-X.; Baker, K. S.; Ingle, D. J.

2026-04-20 microbiology 10.64898/2026.04.17.719127 medRxiv
Top 0.1%
21.7%
Show abstract

Shigella flexneri is the leading causative agent of shigellosis globally. The public health threat posed by S. flexneri is compounded by its emergence as a sexually transmissible infection, importance of international travel in driving dissemination, and the increasing prevalence of antimicrobial resistance (AMR). A rapid and robust computational method is needed to enhance genomic surveillance and systematically explore features of the population structure of this WHO priority pathogen, which is scalable and readily implementable across jurisdictions, particularly as vaccine development efforts are underway. Here, we present Flex-It, a genomic framework and genotyping scheme implemented in Mykrobe for S. flexneri serotypes 1-5, X & Y, compatible with previous approaches used to describe S. flexneris population structure. To develop Flex-It, we curated a retrospective dataset of 5,819 publicly available S. flexneri genomes. We characterised the global population structure for S. flexneri, exploring geographical and temporal traits, and showed the granular diversity of AMR and serotype profiles. We applied Flex-It to >13,000 genomes routinely generated by public health laboratories from Australia, the UK and the USA across a ten-year period. We found significant genotype diversity in all three locations, with the emergence of genotypes with converged resistance to all major drugs currently used for treatment. Flex-It provides an open-source, novel genotyping method that rapidly characterises S. flexneri and its ciprofloxacin resistance determinants in <1 minute from both short and long whole-genome sequencing reads. Flex-It provides the community with a standardised nomenclature to monitor the emergence and spread of S. flexneri lineages.

10
RD-Embed: Unified representations of rare-disease knowledge from clinical records

Groza, T.; Tan, F.; Lim, N. T. R.; Shanmugasundar, M. W.; Kappaganthu, J.; Lieviant, J. A.; Karnani, N.; Chen, H.; Wong, T. Y.; Jamuar, S. S.

2026-04-04 genetic and genomic medicine 10.64898/2026.04.02.26350083 medRxiv
Top 0.1%
19.3%
Show abstract

Rare diseases often present with incomplete, evolving symptoms and signs scattered across clinical notes and coded records, making diagnosis and gene discovery difficult even when genomic data are available. Existing approaches either depend on curated phenotype profiles or use general biomedical language models that are not aligned to rare-disease knowledge, limiting performance in early or ambiguous clinical presentations. Here, we show that RD-Embed - a three-stage representation framework that builds a base space that preserves domain knowledge, aligns clinical text and SNOMED-derived signals, and refines relationships with graph-based learning - enables robust rare-disease retrieval from heterogeneous clinical records. Across ten rare-disease datasets, RD-Embed attains up to >50% top-ten diagnostic retrieval using combined text and phenotype features, compared with ~30% on average for other embedding models and similarly sized large language models. On an EHR stress test, clinical alignment substantially improves text-based retrieval compared with ontology-only representations, supporting use in routine EHR data. We suggest RD-Embed is lightweight model that can be incorporated into existing hospital systems that supports rare disease identification and diagnosis, and gene prioritization.

11
A Multi-Omics Computational Pipeline for Systematic Discovery of Retired Self-Antigens as Cancer Vaccine Targets

Wang, V.; Deng, S.; Aguilar, R.

2026-04-22 genetic and genomic medicine 10.64898/2026.04.20.26351288 medRxiv
Top 0.1%
19.0%
Show abstract

BackgroundThe retired antigen hypothesis, introduced by Tuohy and colleagues, proposes that tissue-specific proteins expressed conditionally during early life or reproductive stages, then silenced in normal aging tissue, represent safe and effective cancer vaccine targets when re-expressed in tumors. To date, discovery of retired antigens has relied entirely on hypothesis-driven wet lab work, limiting throughput. MethodsHere we present RADAR (Retired Antigen Discovery and Ranking), a multi-omics computational pipeline implemented on a standard server that systematically identifies retired antigen candidates. RADAR comprises four core discovery layers integrating: 1) The Genotype-Tissue Expression Portal (GTEx) normal tissue expression, 2) TCGA tumor re-expression, 3) DNA methylation, and 4) miRNA regulatory networks, each applied sequentially to identify genes exhibiting the epigenetic and post-transcriptional hallmarks of tissue-specific retirement followed by tumor re-activation. Candidate characterization is further supported by three automated modules: 1) protein-level safety screening via the Human Protein Atlas, 2) molecular subtype enrichment analysis, and 3) cross-cancer confirmation, which execute automatically when the relevant data are available for the selected cancer type. ResultsThe pipeline independently validated known targets including alpha-lactalbumin (LALBA, the basis of the Tuohy Phase 1 triple-negative breast cancer vaccine trial) and anti-Mullerian hormone (AMH), consistent with Tuohys ovarian cancer vaccine program targeting AMHR2, and rediscovered multiple known cancer-testis antigens (MAGEA1, MAGEC1, SSX1) as positive controls. Among 4,664 initial candidates derived from GTEx, the pipeline identified 20 high-confidence retired antigen candidates passing all filters. DCAF4L2, COX7B2, TEX19, and CT83 emerge as the highest-priority novel candidates for experimental validation, demonstrating zero expression in critical somatic organs, strong epigenetic silencing, and significant re-expression across multiple cancer types. ConclusionRADAR provides the first systematic computational framework for retired antigen discovery, offering a reproducible and scalable approach to expanding the cancer immunoprevention pipeline beyond individually characterized targets. The pipeline is fully reproducible, requires no specialized hardware, and is immediately extensible to additional TCGA cancer types.

12
T-Rex: Standardized Analysis of Germline Variants in Whole-Exome Sequencing Trios

Reh, S.-L.; Walter, C.; Lohse, J.; Ghete, T.; Metzler, M.; Quante, A.; Hauer, J.; Auer, F.

2026-04-01 bioinformatics 10.64898/2026.03.30.715083 medRxiv
Top 0.1%
18.9%
Show abstract

Whole-exome sequencing (WES) enables the identification of rare germline variants contributing to pediatric diseases. Trio-based sequencing, comparing affected children with their parents, is particularly effective for rare disease genetics. However, WES data analysis requires bioinformatics expertise, varies across institutions, and is often incompatible with clinical workflows. We developed T-Rex (Trio Rare variant analysis of EXomes), a cross-platform desktop application that enables the standardized and local analysis of WES germline Trio data without the need for programming knowledge. T-Rex integrates state-of-the-art tools for alignment, dual-variant calling (GATK HaplotypeCaller + VarScan2), annotation (SNPEff/SNPSift), rare-variant filtering based on population frequencies (gnomAD), and family-based statistical testing, including the Transmission Disequilibrium Test with multiple-testing correction. Benchmarking of the dual-caller strategy on the Genome in a Bottle Ashkenazim Trio demonstrates high precision (99.2%) while maintaining robust sensitivity (91.1%). User testing (n=13) confirmed quick learning across clinicians and researchers. Application to a cohort of n=121 pediatric cancer Trio datasets, filtering for rare protein-coding variants (MAF[&le;]0.1% in gnomAD v4.0), validated all assessable previously reported pathogenic variants. Overall, T-Rex enables clinicians to robustly analyze WES Trio data in compliance with data protection regulations without requiring additional software licenses. As one of the first platforms for comprehensive WES Trio analysis that requires no programming expertise while providing clinical-grade, end-to-end workflows, T-Rex facilitates collaborative research between clinics and reduces reliance on external providers. Implementation and AvailabilityThe source code is available on GitHub (https://github.com/SaraLuisaReh/trex). The fully precompiled app is available on Zenodo (https://zenodo.org/records/19135262).

13
Ancestry-stratified variant classification in monogenic diabetes genes: annotation coverage and differential curation burden

Dario, P.

2026-04-07 genetic and genomic medicine 10.64898/2026.04.06.26350230 medRxiv
Top 0.1%
18.8%
Show abstract

Variant databases ClinVar and gnomAD are the backbone of clinical variant interpretation, but their population composition is skewed toward European ancestry. Whether this skew creates systematic classification disadvantages for non-European patients with monogenic diabetes has not been examined at the database level. ClinVar variant_summary (GRCh38, April 2026; 4,421,188 variants) was cross-referenced with gnomAD v4.0 genome data for 17 monogenic diabetes genes. Annotation coverage and variant classification rates were computed stratified by genetic ancestry group (AFR, AMR, EAS, SAS, MID, NFE, FIN, ASJ). Of 14,691 gnomAD variants across the 17 genes, only 29.7% had any ClinVar classification (range: 12.7%-61.3% by gene). Among classified variants, non-Finnish European (NFE) variants had the highest variant of uncertain significance (VUS) rate (32.1%) and the lowest benign/likely benign fraction (41.6%), consistent with a large submission volume without functional follow-up. African-ancestry (AFR) variants showed the second-highest VUS rate (29.2%), not statistically distinguishable from NFE after Bonferroni correction, while all other non-European groups had significantly lower rates (all p < 0.001). GCK showed a pattern inversion - non-European VUS rate (18.5%) exceeding European (15.0%) - consistent with progressive reclassification in European populations absent in non-European cohorts. Annotation coverage and VUS divergence were uncorrelated (r = -0.15, p = 0.57). The primary equity problem is a 70% annotation gap combined with a non-European curation deficit, not a simple VUS excess. Ancestry-stratified evaluation of ClinGen Variant Curation Expert Panel (VCEP) criteria performance is warranted across disease domains.

14
The human pangenome reference reduces ancestry-related biases in somatic mutation detection

Pham, C. V. K.; Abdelmalek, F. S. A.; Hua, T.; Apel, E.; Bizjak, A.; Schmidt, E. J.; Houlahan, K. E.

2026-04-01 bioinformatics 10.64898/2026.03.30.715289 medRxiv
Top 0.1%
18.5%
Show abstract

Commonly used human reference genomes collapse extensive genetic variability into a single linear genome of which 70% is derived from one donor. These linear genomes fail to capture the full spectrum of genetic variation, which can lead to misalignment of sequencing reads particularly for individuals underrepresented by the linear reference genomes. To address this shortcoming, the Human Pangenome Reference Consortium released the first draft of the human pangenome reference, a graph-based reference that integrates diverse haplotypes. While the human pangenome reference has shown increased accuracy in detecting inherited DNA variants, it remains to be seen if the observed improvements extend to somatic mutation detection. Here, we systematically benchmarked somatic single nucleotide variant (SNV) detection leveraging the human pangenome in 30 whole exome sequenced bladder tumours with matched blood tissue of diverse ancestries. We found somatic SNV detection leveraging the human pangenome reference outperformed the linear reference, most notably in individuals of East Asian ancestry where we observed on average a 20% improvement in detection accuracy. Improvements to detection accuracy in individuals of European ancestry were marginal. The increase in accuracy was attributed to reduced germline contamination and reduced reference bias. Further, we demonstrate the pangenome increases SNV detection precision, mitigating the need for time and computationally expensive ensemble approaches that take the consensus across multiple tools. Finally, we demonstrate that the increased precision when aligned to the pangenome generalized to an additional 29 lung adenocarcinoma tumours, particularly for individuals of East Asian ancestry. These findings support adoption of the pangenome to improve somatic variant detection and reduce ancestry-related disparities.

15
iCNG99: a validated genome-scale metabolic model of Cryptococcus neoformans strain H99

Feng, C.; Hu, P.; Zhu, Y.; Ke, W.; Gao, X.; Ding, C.; Zhai, B.; Wang, L.; Dai, Z.

2026-04-15 systems biology 10.64898/2026.04.12.718001 medRxiv
Top 0.1%
18.4%
Show abstract

Cryptococcus neoformans is a ubiquitous environmental fungus that can also cause life-threatening infections in immunocompromised individuals. As a competent pathogen, Cryptococcus needs to reprogram its metabolism to adapt to the drastic differences between environmental niches and host niches. A well-curated genome-scale metabolic model (GEM) is a powerful tool to facilitate the investigation of the metabolic resilience of an organism. Here we reconstructed and validated iCNG99, a GEM for C. neoformans reference strain H99, and evaluated its predictive performance across 43 growth conditions and gene essentiality benchmarks. The model achieved high confidence essential gene prediction (precision = 0.77) and recapitulated pathways targeted by clinically available antifungals. Integration with transcriptomic and metabolomic data enabled iCNG99 to capture condition-specific metabolic adaptations and to identify candidate vulnerabilities in drug tolerance, revealing metabolic adaptations associated with survival within host conditions and drug susceptibility. Together, iCNG99 provides a systems-level computational platform for investigating C. neoformans metabolism and for prioritizing antifungal vulnerabilities.

16
A long-read RNA sequencing and polysome profiling framework reveals transposable element-driven transcript diversity and translational rewiring in glioblastoma

Pizzagalli, M.; Sasipalli, S.; Leary, O.; Tran, L.; Haas, B.; Tapinos, N.

2026-04-21 cancer biology 10.64898/2026.04.18.719388 medRxiv
Top 0.1%
18.2%
Show abstract

BackgroundTransposable elements (TEs) account for over half of the human genome and are often derepressed in cancer. TEs can add cryptic splice sites, undergo exonization, and generate gene-TE fusion transcripts, but the combined effects of TEs on RNA processing and translation in glioblastoma stem cells (GSCs) remains incompletely elucidated. ResultsWe combined long-read RNA sequencing with polysome profiling in four patient-derived GSCs and two neural stem cell (NSC) controls to resolve TE-associated transcript diversity and its relationship to ribosomal engagement. Across GSCs, we identified 13,421 alternative splicing (AS) events, 3,077 of which contained TEs within 150 bp of splice junctions. AS sites proximal to TEs were associated with increased isoform switching compared to non-TE-associated AS sites (odds ratio 2.9 - 4.3). Moreover, AS isoforms generated from TE-proximal sites were more likely to exhibit altered ribosomal association (odds ratio 2.54). Directional shifts were observed, with shorter isoforms associating with monosome fractions and longer isoforms with polysome fractions. To enable systematic detection of gene - TE chimeric transcripts, we developed FuTER (Fusion TE Reporter), a long-read-based framework for identifying TE-associated fusions. Application to GSC datasets identified 78 GSC enriched fusion transcripts, several supported by breakpoint-spanning reads in polysome fractions, consistent with ribosome association. ConclusionsOur data suggest that TEs correlate with abnormal splicing activity and altered ribosome engagement in glioblastoma stem cells. By integrating long-read sequencing with polysome profiling and fusion detection, we establish a framework for analysis of TE-induced transcript diversity and its effects on cancer evolution and plasticity.

17
Combining mutation detection with fragmentomics features leads to improved tumor-informed ctDNA detection

Lin, Y.; Oroperv, C.; Frydendahl, A.; Rasmussen, M. H.; Andersen, C. L.; Besenbacher, S.

2026-04-01 bioinformatics 10.64898/2026.03.30.714025 medRxiv
Top 0.1%
18.2%
Show abstract

Liquid biopsy through circulating tumor DNA (ctDNA) analysis enables non-invasive detection of minimal residual disease (MRD) and early identification of cancer relapse, facilitating timely clinical intervention. However, detecting ctDNA in plasma samples with low tumor burden remains challenging due to the scarcity of mutant molecules, the background noise of sequencing errors and somatic mutations in normal cell-free DNA (cfDNA). Here, we present a mutation-informed fragmentomic framework and evaluate it on 90 stage III colorectal cancer patients with three years of follow-up. Using 712 serial whole-genome sequenced cfDNA samples (30x) with matched whole-genome sequencing of tumor tissue and germline DNA from buffycoat for each patient, we collected cfDNA fragments spanning tumor-derived somatic mutation positions and compared fragmentomic characteristics of mutation-bearing and non-mutated cfDNA fragments within the same sample. By leveraging fragment length and fragment end-motif patterns, our approach can distinguish cancer-positive from cancer-negative plasma samples without requiring model training or panel-of-normals calibration. The method achieved AUCs of 0.863 and 0.74 using fragment length and end motif features, respectively, and 0.871 when combined, outperforming tumor fraction estimates based on the frequency of mutated fragments (AUC=0.832). Integrating fragmentomic features with tumor fraction further improved performance, yielding an AUC of 0.873, indicating complementary signals between fragmentomic patterns and mutation burden. Aggregated analyses revealed ctDNA-specific patterns, including fragment shortening, motif enrichment of A/T ends, and depletion of C/G ends, directly linking fragmentomic features to tumor-derived cfDNA. Overall, mutation-informed fragmentomic profiling improves ctDNA detection beyond counting mutant reads and provides a scalable, training-free strategy for MRD assessment and early relapse detection while offering mechanistic insights into tumor-specific cfDNA biology.

18
Resolution of the D4Z4 repeat responsible for facioscapulohumeral muscular dystrophy with HiFi sequencing

Chen, X.; Lemmers, R. J. L. F.; Kronenberg, Z.; Devaney, J. M.; Noya, J.; Berlyoung, A. S.; Yusuff, S.; Lynch, S.; Nykamp, K.; Lyndy, A. S.; Dolzhenko, E.; van der Maarel, S. M.; Eberle, M. A.

2026-04-14 genetics 10.64898/2026.04.10.717730 medRxiv
Top 0.2%
18.1%
Show abstract

The D4Z4 macrosatellite repeat encompasses some of the most difficult-to-resolve disease-related variations in the human genome. D4Z4 has a repeat unit of 3.3 kb (encoding the DUX4 gene) that is present in up to 100 copies on two chromosomes (4 and 10), while DUX4 can only be expressed in somatic cells from the permissive A haplotype that usually occurs on chromosome 4. Facioscapulohumeral muscular dystrophy (FSHD) is caused by chromatin relaxation and ectopic expression of DUX4 in skeletal muscle, mediated by contraction of D4Z4 to 1-10 copies (FSHD1, 95% of FSHD cases) or mutations in chromatin factor genes such as SMCHD1 (FSHD2, 5% of FSHD cases). Due to its large size, disease specific haplotypes and sequence homology between chromosomes, D4Z4 is challenging to resolve by current sequencing technologies. We report a computational tool, Kivvi, to genotype D4Z4 using PacBio whole-genome long-read sequence data. Kivvi detects all D4Z4 alleles in a sample, reporting the repeat size, chromosome (4 vs. 10), distal haplotype (A vs. non-permissive haplotypes) and the methylation level of each allele. We validated Kivvi against gold standard assays for FSHD diagnostics, detecting 100% of contracted alleles and correctly classifying 90% of noncontracted alleles. We showed differential methylation signals between FSHD1 and candidate FSHD2 samples. We profiled D4Z4 across 601 individuals from five ancestral populations, revealing extensive genetic diversity. We identified common haplotypes of D4Z4 alleles and characterized hybrid repeat units, hybrid repeat arrays, and translocation alleles. Combined with HiFi long reads, Kivvi enables the consolidation of multiple FSHD assays into a single workflow and facilitates the discovery of novel genetic modifiers of FSHD through population-scale studies.

19
Long-read analysis of tetrameric microsatellites with vmwhere supports GGAA repeat length-dependent chromatin state association in Ewing sarcoma

Peterson, S. K.; Massie, A. M.; Rubinsteyn, A.; Wang, J. R.; Davis, I. J.

2026-04-10 cancer biology 10.64898/2026.04.08.717017 medRxiv
Top 0.2%
17.9%
Show abstract

Microsatellites are abundant genomic elements that contribute to genetic diversity and disease-associated regulatory variation. Although long-read sequencing enables accurate resolution of repetitive regions, computational methods for fully resolved microsatellite genotyping remain limited. Here, we introduce variant motif where (vmwhere), a computational framework for identifying, genotyping, decomposing, and visualizing complex tetrameric microsatellites from long-read sequencing data. Using simulated error-free reads, vmwhere accurately measures several genotyping metrics, including allele length, repeat length, maximum consecutive repeat length, and motif density. Applied to long-read whole-genome sequencing data, vmwhere identified sequence interruptions, motif-specific differences in repeat architecture, and ancestry-associated allele variation, including long repeat alleles that exceed short-read sequencing limitations. We applied vmwhere to GGAA microsatellites in Ewing sarcoma, an aggressive pediatric cancer driven by EWS-FLI1 fusion oncoprotein, which binds to microsatellites and remodels chromatin. Genome-wide integration of long-read-defined microsatellite architecture with chromatin accessibility and EWS-FLI1 binding revealed that GGAA repeat structure was associated with chromatin state, with longer consecutive repeat microsatellites exhibiting increased EWS-FLI1 binding and chromatin accessibility. Cell line-specific expansions and contractions of GGAA microsatellite repeat length were associated with gains and losses of chromatin accessibility. Further, we identified haplotype-specific chromatin states, with preferential binding and accessibility at longer alleles. Together, these results establish vmwhere as a scalable framework for resolving population-level microsatellite variation and linking repeat architecture to chromatin state. Repeat structure and length characteristics provides insights into genotype-function relationships at microsatellite repeats in cancer.

20
From GWAS to drug: A framework for drug candidate prioritisation using a gene expression signature matching approach

Chauquet, S.; Jiang, J.-C.; Barker, L. F.; Hunter, Z. L.; Singh, G.; Wray, N. R.; McRae, A. F.; Shah, S.

2026-04-24 genetic and genomic medicine 10.64898/2026.04.22.26349470 medRxiv
Top 0.2%
17.2%
Show abstract

Drug targets supported by human genetic evidence have significantly higher approval rates, making genome-wide association studies a valuable resource for drug candidate prioritisation. Transcriptome-wide association study signature-matching is an emerging in silico approach that integrates GWAS data with expression quantitative trait loci to generate a disease gene expression signature, which is then compared against drug perturbation databases such as the Connectivity Map. Despite recent adoption, there is no consensus on optimal methodology. Here, we systematically benchmark key parameters, including TWAS method, eQTL tissue model, similarity metric, gene set size, and CMap cell line, using LDL cholesterol, familial combined hyperlipidemia, and asthma as proof-of-concept traits. We demonstrate that while TWAS signature-matching can successfully prioritise known first-line treatments, performance is highly sensitive to parameter choice; for instance, the selection of the cell line used for drug signatures alone can dramatically alter drug prioritisation. Based on these findings, we propose a best-practice framework for robust, genetically-informed drug prioritisation using TWAS signature-matching.